fix: remove dangling embedding entries#824
Open
JamesGuthrie wants to merge 7 commits intojg/split-tx-processingfrom
Open
fix: remove dangling embedding entries#824JamesGuthrie wants to merge 7 commits intojg/split-tx-processingfrom
JamesGuthrie wants to merge 7 commits intojg/split-tx-processingfrom
Conversation
d7bb489 to
01afd44
Compare
01afd44 to
38a22d1
Compare
* chore: split deps into separate extras * chore: update docs and dockerfile * chore: only install vectorizer-worker extra in dockerfile
The vectorizer worker used to process queue items in a single transaction. If any step (other than file loading) failed, it would cause the processing to abort, and later be retried. Because it operated in a single transaction, it did not record failed attempts in the queue table. This change rearchitects the queue item processing to consist of two transactions: - The "fetch work" transaction gets a batch of rows from the database for processing. It updates the `attempts` column of those rows to signal that an attempt has been made to process the item. It deletes duplicate queue items for the same primary key columns. - The "embed and write" transaction performs embedding, writes the embeddings to the database, and removes successfully processed queue rows. Rows which failed to be processed have the `retry_after` column set to a value proportional to the number of existing attempts. When the `attempts` column goes over a predefined threshold (6), the queue item is moved to the "failed" (dead letter) queue.
There is no FK from the queue table to the source table, or from the embedding table to the source table. As a result, we can have "dangling" entries in both the queue table and the embedding table. These occur because of a race condition between deleting a source table row, and the processing of the embeddings for that row. To "clean up" dangling embeddings, we insert a queue row when a source table row is deleted. This queue row is dangling. When a dangling queue row is identified, we use the PK values of the queue row to remove all associated embeddings (if present). Co-authored-by: Jascha <jascha@timescale.com>
38a22d1 to
dc12bf9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There is no FK from the queue table to the source table, or from the embedding table to the source table. As a result, we can have "dangling" entries in both the queue table and the embedding table. These occur because of a race condition between deleting a source table row, and the processing of the embeddings for that row.
To "clean up" dangling embeddings, we insert a queue row when a source table row is deleted. This queue row is dangling.
When a dangling queue row is identified, we use the PK values of the queue row to remove all associated embeddings (if present).